SGDR: Stochastic Gradient Descent with Warm Restarts
نویسندگان
چکیده
Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on CIFAR10 and CIFAR-100 datasets where we demonstrate new state-of-the-art results below 4% and 19%, respectively. Our source code is available at https://github.com/loshchil/SGDR.
منابع مشابه
Fixing Weight Decay Regularization in Adam
L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam. While common deep learning frameworks of these algorithms implement L2 regularization (often calling it “weight decay” in what may be misleading due to the inequi...
متن کاملCSE 599 i : Online and Adaptive Machine Learning Winter 2018 Lecture 6 : Non - stochastic best arm identification
Example 1. Imagine that we are solving a non-convex optimization problem on some (multivariate) function f using gradient descent. Recall that gradient descent converges to local minima. Because non-convex functions may have multiple minima, we cannot guarantee that gradient descent will converge to the global minimum. To resolve this issue, we will use random restarts, the process of starting ...
متن کاملIdentification of Multiple Input-multiple Output Non-linear System Cement Rotary Kiln using Stochastic Gradient-based Rough-neural Network
Because of the existing interactions among the variables of a multiple input-multiple output (MIMO) nonlinear system, its identification is a difficult task, particularly in the presence of uncertainties. Cement rotary kiln (CRK) is a MIMO nonlinear system in the cement factory with a complicated mechanism and uncertain disturbances. The identification of CRK is very important for different pur...
متن کاملTexture analysis with the Volterra model using conjugate gradient optimisation
Texture is an important characteristic di erentiating objects or regions of interest in an image. A variety of approaches to texture analysis has previously been proposed; the approach in this paper is from the stochastic eld, whereby a two-dimensional Volterra model is used to represent a texture eld. Statistical moments are introduced as a means of determining the Volterra model generator for...
متن کاملPrimal-Dual Projected Gradient Algorithms for Extended Linear-Quadratic Programming
Many large-scale problems in dynamic and stochastic optimization can be modeled with extended linear-quadratic programming, which admits penalty terms and treats them through duality. In general the objective functions in such problems are only piecewise smooth and must be minimized or maximized relative to polyhedral sets of high dimensionality. This paper proposes a new class of numerical met...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016